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Abstract 

Motivation: Clustering is a frequently used con- 
cept in variety of bioinformatical applications. We 
present a new method for hierarchical clustering of 
data called mutual information clustering (MIC) al- 
gorithm. It uses mutual information (MI) as a sim- 
ilarity measure and exploits its grouping property: 
The MI between three objects X, Y, and Z is equal 
to the sum of the MI between X and Y, plus the MI 
between Z and the combined object {XY). 
Results: We use this both in the Shannon (prob- 
abilistic) version of information theory, where the 
"objects" are probability distributions represented 
by random samples, and in the Kolmogorov (algo- 
rithmic) version, where the "objects" are symbol se- 
quences. We apply our method to the construction of 
mammal phylogenetic trees from mitochondrial DNA 
sequences and we reconstruct the fetal ECG from the 
output of independent components analysis (ICA) 
applied to the ECG of a pregnant woman. 
Availability: The programs for estimation of MI 
and for clustering (probabilistic version) are available 
at http:/ /www. fz-juelich.de/nic/cs/software. 
Contact: a.kraskov@fz-juelich.de 

1 Introduction 

Classification or organizing of data is very important 
in all scientific disciplines. It is one of the most fun- 
damental mechanism of understanding and learning 
[Jain and Dubes, 1988| . Depending on the problem, 
classification can be exclusive or overlapping, super- 



vised or unsupervised. In the following we will be in- 
terested only in exclusive unsupervised classification. 
This type of classification is usually called clustering 
or cluster analysis. 

An instance of a clustering problem consist of a set 
of objects and a set of properties (called characteris- 
tic vector) for each object. The goal of clustering is 
the separation of objects into groups using only the 
characteristic vectors. Indeed, in general only cer- 
tain aspects of the characteristic vectors will be rel- 
evant, and extracting these relevant features is one 
field where mutual information (MI) plays a major 
role [Tishby et al., 1999| , but we shall not deal with 
this here. Cluster analysis organizes data either as 
a single grouping of individuals into non-overlapping 
clusters or as a hierarchy of nested partitions. The 
first approach is called partitional clustering (PC), 
the second one is hierarchical clustering (HC). One 
of the main features of HC methods is the visual im- 
pact of the dendrogram which enables one to see how 
objects are being merged into clusters. From any 
HC one can obtain a PC by restricting oneself to a 
"horizontal" cut through the dendrogram, while one 
cannot go in the other direction and obtain a full hier- 
archy from a single PC. Because of their wide spread 
of applications, there are a large variety of diff'erent 
clustering methods in use [Jain and Dubes, 1988| . 

The crucial point of all clustering algorithms is the 
choice of a proximity measure. This is obtained from 
the characteristic vectors and can be either an indi- 
cator for similarity (i.e. large for similar and small 
for dissimilar objects), or dissimilarity. In the latter 
case it is convenient but not obligatory if it satisfies 
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the standard axioms of a metric (positivity, symme- 
try, and triangle inequality). A matrix of all pair- 
wise proximities is called proximity matrix. Among 
HC methods one should distinguish between those 
where one uses the characteristic vectors only at the 
first level of the hierarchy and derives the proxim- 
ities between clusters from the proximities of their 
constituents, and methods where the proximities are 
calculated each time from their characteristic vectors. 
The latter strategy (which is used also in the present 
paper) allows of course for more flexibility but might 
also be computationally more costly. 

Quite generally, the "objects" to be clustered can 
be either single (finite) patterns (e.g. DNA se- 
quences) or random variables, i.e. probability dis- 
tributions. In the latter case the data are usually 
supplied in form of a statistical sample, and one of 
the simplest and most widely used similarity mea- 
sures is the linear (Pearson) correlation coefficient. 
But this is not sensitive to nonlinear dependencies 
which do not manifest themselves in the covariance 
and can thus miss important features. This is in 
contrast to mutual information (MI) which is also 
singled out by its information theoretic background 
[Cover and Thomas, 1991] . Indeed, MI is zero only if 
the two random variables are strictly independent. 

Another important feature of MI is that it 
has also an "algorithmic" cousin, defined within 
algorithmic (Kolmogorov) information theory 
|Li and Vitanyi, 1997| which measures the similarity 
between individual objects. For a thorough dis- 
cussion of distance measures based on algorithmic 
MI and for their application to clustering, see 
|Li et al , 200Tl|Li et al.,"2002| . 

Another feature of MI which is essential for the 
present application is its grouping property: The MI 
between three objects (distributions) X, Y, and Z is 
equal to the sum of the MI between X and Y, plus 
the MI between Z and the combined object (joint 
distribution) (XY), 

I{X,Y,Z)=I{X,Y)+I{{X,Y),Z). (1) 

Within Shannon information theory this is an exact 
theorem (see below), while it is true in the algorith- 
mic version up to the usual logarithmic correction 



terms |Li and Vitanyi, 1997| . Since X, Y, and Z can 
be themselves composite, Eq.Q can be used recur- 
sively for a cluster decomposition of MI. This moti- 
vates the main idea of our clustering method: instead 
of using e.g. centers of masses in order to treat clus- 
ters like individual objects in an approximative way, 
we treat them exactly like individual objects when 
using MI as proximity measure. 

More precisely, we propose the following scheme 
for clustering n objects with MIC: 

(1) Compute a proximity matrix based on pairwise 
mutual informations; assign n clusters such that each 
cluster contains exactly one object; 

(2) find the two closest clusters i and j; 

(3) create a new cluster (ij) by combining i and j; 

(4) delete the lines/columns with indices i and j from 
the proximity matrix, and add one line/column con- 
taining the proximities between cluster (ij) and all 
other clusters; 

(5) if the number of clusters is still > 2, goto (2); else 
join the two clusters and stop. 

In the next section we shall review the pertinent 
properties of MI, both in the Shannon and in the 
algorithmic version. This is applied in Sec. 3 to con- 
struct a phylogenetic tree using mitochondrial DNA 
and in Sec. 4 to cluster the output channels of an 
independent component analysis (ICA) of an elec- 
trocardiogram (ECG) of a pregnant woman, and to 
reconstruct from this the maternal and fetal ECGs. 
We finish with our conclusions in Sec. 6. 

2 Mutual Information 
2.1 Shannon Theory 

Assume that one has two random variables X and Y. 
If they are discrete, we write Pi{X) = prob(X = Xi), 
Pi{Y) = prob(F = Xi), and pij = prob(X = Xi,Y = 
yi) for the marginal and joint distribution. Otherwise 
(and if they have finite densities) we denote the den- 
sities by ^x{x), ^lyiv), and ^{x,y). Entropies are 
defined for the discrete case as usual by H{X) = 

-Y.^p,{x)\ogp,{x), H{Y) = -E,,K(y)iogp,(r), 

and H{X,Y) — ~J2ijPij^'^&Pij- Conditional en- 
tropies are defined as HiX\Y) = H{X, Y) - H{Y) = 
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— ^ijPij^ogPiy. The base of the logarithm deter- 
mines the units in which information is measured. 
In particular, taking base two leads to information 
measured in bits. In the following, we always will 
use natural logarithms. The MI between X and Y is 
finally defined as 



I{X,Y) 



H{X)+H{Y)-H{X,Y) 
^Pio log; 



^,3 



'p,{x)pdYy 



(2) 



It can be shown to be non-negative, and is zero only 
when X and Y are strictly independent. For n ran- 
dom variables Xi , X2 ■ ■ ■ Xn , the MI is defined as 

n 

I{Xi,.. .,Xn) = - H{Xi, . . .,Xn). (3) 

fe=l 

This quantity is often referred to as (generalized) re- 
dundancy, in order to distinguish it from different 
"mutual informations" which are constructed analo- 
gously to higher order cumulants, but we shall not 
follow this usage. Eq.J^I can be checked easily, to- 
gether with its generalization to arbitrary groupings. 
It means that MI can be decomposed into hierar- 
chical levels. By iterating it, one can decompose 
I{Xi . . . Xn) for any n > 2 and for any partitioning 
of the set {Xi . . . Xn) into the Mis between elements 
within one cluster and Mis between clusters. 

For continuous variables one first introduces some 
binning ('coarse-graining'), and applies the above to 
the binned variables. If a; is a vector with dimen- 
sion m and each bin has Lebesgue measure A, then 
Pi{X) « fj,x{x)A''" with X chosen suitably in bin i, 
and ^ 

H^,n{X)^HiX)-mlogA (4) 
where the differential entropy is given by 



HiX) 



dx fixix) log fix (x). 



(5) 



Notice that H-t^iniX) is a true (average) information 
and is thus non-negative, but H{X) is not an in- 
formation and can be negative. Also, H{X) is not 
invariant under homeomorphisms x — > (l){x). 

Notice that we have here assumed that densities really 
exists. If not e.g. if X lives on a fractal set), then m is to be 
replaced by the Hausdorff dimension of the measure ^. 



Joint entropies, conditional entropies, and MI are 
defined as above, with sums replaced by integrals. 
Like H(X), joint and conditional entropies are nei- 
ther positive (semi-)definite nor invariant. But MI, 
defined as 



I{X,Y) 



dxdy p.xY(x,y) log 



lJ'XY{x,y) 
fix{x)pY{y) ' 



(6) 

is non-negative and invariant under x —>■ (j){x) and 
y — > ipiy). It is (the limit of) a true information, 

I{X,Y) - hmJffbin(X) + Hy,UY) - HuniX,Y)]. 

In applications, one usually has the data avail- 
able in form of a statistical sample. To esti- 
mate I{X, Y) one starts from N bivariate measure- 
ments {xi,yi), i = 1, . . . N which are assumed to 
be iid (independent identically distributed) realiza- 
tions. There exist numerous algorithms to esti- 
mate I{X, Y) and entropies. We shall use in the 
following the MI estimators proposed recently in 
Ref. [Kraskov et al. , 2003, , and we refer to this pa- 
per for a review of alternative methods. 

2.2 Algorithmic Information Theory 

In contrast to Shannon theory where the basic objects 
are random variables and entropies are average infor- 
mations, algorithmic information theory deals with 
individual symbol strings and with the actual infor- 
mation needed to specify them. To "specify" a se- 
quence X means here to give the necessary input to 
a universal computer U, such that U prints X on its 
output and stops. The analogon to entropy, called 
here usually the complexity K{X) of X, is the mini- 
mal length of an input which leads to the output X, 
for fixed U. It depends on U, but it can be shown 
that this dependence is weak and can be neglected in 
the limit when K{X) is large |Li and Vitanyi, 1997| . 

Let us denote the concatenation of two strings X 
and Y as XY. Its complexity is K{XY). It is intu- 
itively clear that K{XY) should be larger than K{X) 
but cannot be larger than the sum K{X) + K{Y). 
Finally, one expects that K{X\Y), defined as the 
minimal length of a program printing X when Y is 
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furnished as auxiliary input, is related to K{XY) — 
K{Y). Indeed, one can show |Li and Vitanyi, 1997| 
(again within correction terms which become irrele- 
vant asymptotically) that 

0<K{X\Y)c::iK{XY)-K{Y)<K{X). (8) 

Notice the close similarity with Shannon entropy. 

The algorithmic information in Y about X is fi- 
nally defined as 



h,^{X,Y)=K{X)-K{X\Y) 



But d{X, Y) is not appropriate for our purposes. 
Since we want to compare the proximity between 
two single objects and that between two clusters 
containing maybe many objects, we would like the 
distance measure to be unbiased by the sizes of 
the clusters. As argued forcefully in |Li et al., 2001| 
|Li et al., 2002| , this is not true for /aig {X, Y) , and for 
the same reasons it is not true for I{X, Y) or d{X, Y) 
either: A mutual information of thousand bits should 
be considered as large, if X and Y themselves are 
]:{(^X)+K(Y) K{XYj^^^ thousand bits long, but it should be considered 
(9) 



Within the same additive correction terms, one shows 
that it is symmetric, /aig(X, Y) = ledgiY, X), and can 
thus serve as an analogon to mutual information. 

From the halting theorem it follows that K{X) 
is in general not computable. But one can easily 
give upper bounds. Indeed, the length of any in- 
put which produces X (e.g. by spelling it out ver- 
batim) is an upper bound. Improved upper bounds 
are provided by any file compression algorithm such 
as gnuzip or UNIX "compress". Good compression 
algorithms will give good approximations to K(X), 
and algorithms whose performance does not depend 
on the input file length (in particular since they do 
not segment the file during compression) will be cru- 
cial for the following. 

2.3 Mi-Based Distance Measures 

Mutual information itself is a similarity measure in 
the sense that small values imply large "distances" 
in a loose sense. But it would be useful to mod- 
ify it such that the resulting quantity is a metric in 
the strict sense, i.e. satisfies the triangle inequal- 
ity. Indeed, the first such metric is well known 
[Cover and Thomas, 1991] : The quantity 

d{X, Y) = H{X\Y) + H{Y\X) = H{X, Y) - I{X, Y) 

(10) 

satisfies the triangle inequality, in addition to be- 
ing non-negative and symmetric and to satisfying 
d{X, X) = 0. The proof proceeds by first showing 
that for any Z 

H{X\Y) < H{X, Z\Y) < H{X\Z) + H{Z\Y). (11) 



as very small, if X and Y would each be huge, say 
one million bits. 

As shown in |Li et al., 20011 |Li et al., 2002| within 
the algorithmic framework, one can form two differ- 
ent distances which measure relative distance, i.e. 
which are normalized by dividing by a total en- 
tropy. We sketch here only the theorems and proofs 
for the Shannon version, they are indeed very sim- 
ilar to their algorithmic analoga in |Li et al., 2001| 
|Li et al., 20*021 . 
Theorem 1: The quantity 

I{X,Y) 



D{X,Y) = l- 



d{X,Y) 



(12) 



H{X,Y) H{X,Y) 

and D{X, Y) < 1 for 



is a metric, with D{X,X) 
all pairs {X, Y) . 

Proof: Symmetry, positivity and boundedness are 
obvious. Since D{X,Y) can be written as 



DiX,Y) = 



H{X\Y) H{Y\X) 
H{X,Y) H{Y,X)' 



(13) 



it is sufficient for the proof of the triangle inequality 
to show that each of the two terms on the r.h.s. is 
bounded by an analogous inequality, i.e. 



HiX\Y) ^ H{X\Z) 



H{Z\Y) 



H{X,Y) - H{X,Z) H{Z,Y) 



(14) 



and similarly for the second term. Ea. H14() is proven 
straightforwardly, using Ea. (|ll() and the basic in- 
equalities H{X) > 0, H{X,Y) < H{X,Y,Z) and 
H{X\Z) > 0: 



H{X\Y) 
H{X,Y) 



H{X\Y) 



H{Y) + H{X\Y) 
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< 



< 



H{X\Z) + H{Z\Y) 
H{Y) + H{X\Z) + H{Z\Y) 
H{X\Z) + H{Z\Y) 
H{X\Z) + H{Y,Z) 

H{X\Z) H{Z\Y) 



H{X\Z) + H{Z) H{Y,Z) 
H{X\Z) H{Z\Y) 
H{X,Z) H{Z,Y)- 

Theorem 2: The quantity 

I{X,Y) 



(15) 



D'{X, Y) 



1 



ma.x{H{X),H{Y)} 
nmx{H{X\Y),H{Y\X)} 
mSix{H{X),H{Y)} 



(16) 



is also a metric, also with D'{X,X) = and 
D'{X,Y) < 1 for all pairs {X,Y). It is sharper than 
D, i.e. D'{X,Y) < D{X,Y). 

Proof: Again we have only to prove the triangle 
inequality, the other parts are trivial. For this we 
have to distinguish different cases |Li et a l., 2002 . 
Case 1: ma.x{H{Z), H{Y)} < H{X). Using Eq.(fTT| 
we obtain 

n'(YV\ - ^ H{X\Z) , H{Z\Y) 

-ii{xr-^i{xy^^i{Yy 

^ D'{X,Z) + D'{Z,Y). (17) 

Case 2: max{H{Z), H{X)} < H{Y). This is com- 
pletely analogous. 

Case 3: H{X) < H(Y) < H(Z). We now have to 
show that 

D'{X,Y) = < H{Y\Z) + H{Z\X) 



H{Y) 



H{Y) 



< D'{X,Z) + D'{Z,Y) 
H{Z\X) H{Z\Y) 
H{Z) ^ H{Z) ■ 



(18) 



Indeed, if the r.h.s. of the first line is less than 1, 
then 



H{Y\X) H{Y\Z)+H{Z\X) 

H{Y) - HiY) 

H{Y\Z)+H{Z\X) 



< 



H{Z) - HiY) 



H{Z\Y)+H{Z\X) 
H{Z) 



(19) 



H{Z) 



and Ea. (|18|) holds. If it is larger than 1, then also 
{H(Z\Y) + H(Z\X))/H{Z) > 1. Eq.lUHIl must now 
also hold, since H(Y\X)/H(Y) < 1. 
Case 4: H(Y) < H{X) < H{Z). This is completely 
analogous to case 3. 

Apart from scaling correctly with the total in- 
formation, in contrast to d{X,Y), the algorithmic 
analog to D{X,Y) and D'{X,Y) are also universal 
|Li et al., 2002] . Essentially this means that ii X KiY 
according to any non-trivial distance measure, then 
AT « y also according to Z), and even more so (by 
factor up to 2) according to D' . In contrast to the 
other properties of D and D', this is not easy to carry 
over from algorithmic to Shannon theory. The proof 
in Ref. |Li et al., 2002| depends on X and Y being 
discrete, which is obviously not true for probability 
distributions. Based on the universality argument, 
it was argued in |Li et al., 200 2' that D' should be 
superior to but the numerical studies shown in 
that reference did not show a clear difference between 
them. In the following we shall therefore use primar- 
ily D for simplicity, but we checked that using D' did 
not give systematically better results. 

A major difficulty appears in the Shannon frame- 
work, if we deal with continuous random variables. 
As we mentioned above. Shannon informations are 
only finite for coarse-grained variables, while they di- 
verge if the resolution tends to zero. This means that 
dividing MI by the entropy as in the definitions of D 
and D' becomes problematic. One has essentially two 
alternative possibilities. The first is to actually intro- 
duce some coarse-graining, although it would have 
been necessary for the definition of /(A, F), and di- 
vide by the coarse-grained entropies. This introduces 
an arbitrariness, since the scale A is completely ad 
hoc, unless it can be fixed by some independent argu- 
ments. We have found no such arguments, and thus 
we propose the second alternative. There we take 
A — > 0. In this case H{X) ^ m^, log A, with m^, be- 
ing the dimension of A. In this limit D and D' would 
tend to 1. But using similarity measures 

S{X,Y) = {l-D{X,Y))log{l/A), (20) 
5'(A,y) = (l-i?'(A,r))log(l/A) (21) 
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instead of D and D' gives exactly the same results in 
MIC, and 



S{X,Y) 



rrix + my 



S'{X,Y) 



I{X,Y) 



max{TOa;, TTLy} 

(22) 

Thus, when deahng with continuous variables, we di- 
vide the MI either by the sum or by the maximum 
of the dimensions. When starting with scalar vari- 
ables and when X is a cluster variable obtained by 
joining m elementary variables, then its dimension is 
just rrix — m. 



3 A Phylogenetic Tree for 
Mammals 

As a first application, we study the mitochondrial 
DNA of a group of 34 mammals (see Fig. 1). Ex- 
actly the s ame data |Gen, a| had previously b een an- 
alyzed in |Li et al., 2001} [Reyes et al., 2000| . This 
group includes among others'' some rodents'^, fer- 
ungulates"*, and primates^. It had been chosen in 
|Li et al., 2001| because of doubts about the relative 
closeness among these three groups |Cao et al., 1998| 
[Reyes et al, 2000| . 

Obviously, we are here dealing with the algorith- 
mic version of information theory, and informations 
are estimated by lossless data compression. For con- 
structing the proximity matrix between individual 



^opossum (Didelphis virginiana) , wallaroo (Macropus ro- 
bustus), and platypus {Ornithorhyncus anatinus) 

^rabbit {Oryctolagus cuniculus), guinea pig {Cavia porcel- 
lus), fat dormouse (Glis glis), rat (Rattus norvegicus), squirrel 
(Scuirus vulgaris), and mouse (Mus musculus) 

^horse {Equu caballus), donkey {Equus asinus), Indian 
rhinoceros {Rhinoceros unicornis), white rhinoceros [Cera- 
totherium simum), harbor seal {Phoca vitulina), grey seal 
(Halichoerus grypus), cat (Felis catus), dog (Canis familiaris), 
fin whale {Balenoptera physalus), blue whale {Balenoptera 
musculus), cow {Bos taurus), sheep {Ovis aries), pig {Sus 
scrofa), hippopotamus {Hippopotamus amphibius), neotropi- 
cal fruit bat {Artibeus jamaicensis), African elephant {Lox- 
odonta africana), aardvark {Orycteropus afer), and armadillo 
{Dasypus novemcintus) 

^human {Homo sapiens), common chimpanzee {Pan 
troglodytes), pigmy chimpanzee {Pan paniscus), gorilla {Go- 
rilla gorilla), orangutan {Pongo pygmaeus), gibbon {Hylobates 
lar), and baboon {Papio hamadryas) 
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Figure 1: Phylogenetic tree for 34 mammals (31 eu- 
therians plus 3 non-placenta mammals). In contrast 
to Fig. ^ the heights of nodes are equal to the dis- 
tances between the joining daughter clusters. 



taxa, we proceed essentially a in Ref. [Li et al., 2001] . 
But in addition to using the special compression 
program GenCompress [Gen, b[ , we also tested sev- 
eral general purpose compression programs such as 
BWTzip [BWT, [ and the UNIX tool bzip2. 

In Ref. [Li et al., 2001[ , this proximity matrix was 
then used as the input to a standard HC algorithm 
(neighbour-joining and hypercleaning) to produce an 
evolutionary tree. It is here where our treatment 
deviates crucially. We used the MIC algorithm de- 
scribed in Sec. 1, with distance D{X,Y). The join- 
ing of two clusters (the third step in the MIC algo- 
rithm) is obtained by simply concatenating the DNA 
sequences. There is of course an arbitrariness in the 
order of concatenation sequences: XY and YX give 
in general compressed sequences of different lengths. 
But we found this to have negligible effect on the 
evolutionary tree. The resulting evolutionary tree ob- 
tained with Gencompress is shown in Fig. ^ 

As shown in Fig. the overall structure of 
this tree closely resembles the one shown in 
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Ref. [Reyes et al., 2000] . All primates are correctly 
clustered and also the relative order of the ferungu- 
lates is in accordance with Ref. [Reyes et al., 2000| . 
On the other hand, there are a number of connections 
which obviously do not reflect the true evolutionary 
tree, see for example the guinea pig with bat and 
elephant with platypus. But the latter two, inspite 
of being joined together, have a very large distance 
from each other, thus their clustering just reflects 
the fact that neither the platypus nor the elephant 
have other close relatives in the sample. All in all, 
however, already the results shown in Fig. 1 capture 
surprisingly well the overall structure shown in Ref. 
[Reyes et al., 2000| . Dividing MI by the total infor- 
mation is essential for this success. If we had used 
the non- normalized Iaig{X,Y) itself, the clustering 
algorithm used in [Li et al., 2001[ would not change 
much, since all 34 DNA sequences have roughly the 
same length. But our MIC algorithm would be com- 
pletely screwed up: After the first cluster formation, 
we have DNA sequences of very different lengths, and 
longer sequences tend also to have larger MI, even if 
they are not closely related. 

A heuristic reasoning for the use of MIC for the re- 
construction of an evolutionary tree might be given 
as follows: Suppose that a proximity matrix has been 
calculated for a set of DNA sequences and the small- 
est distance is found for the pair {X,Y). Ideally, one 
would remove the sequences X and Y, replace them 
by the sequence of the common ancestor (say Z) of 
the two species, update the proximity matrix to find 
the smallest entry in the reduced set of species, and 
so on. But the DNA sequence of the common ances- 
tor is not available. One solution might be that one 
tries to reconstruct it by making some compromise 
between the sequences X and Y. Instead, we essen- 
tially propose to concatenate the sequences X and Y. 
This will of course not lead to a plausible sequence of 
the common ancestor, but it will optimally represent 
the information about the common ancestor. During 
the evolution since the time of the ancestor Z , some 
parts of its genome might have changed both in X 
and in Y . These parts are of little use in construct- 
ing any phylogenetic tree. Other parts might not 
have changed in either. They are recognized anyhow 
by any sensible algorithm. Finally, some parts of its 



genome will have mutated significantly in X but not 
in Y, and vice versa. This information is essential to 
find the correct way through higher hierarchy levels 
of the evolutionary tree, and it is preserved in con- 
catenating. 

4 Clustering of Minimally De- 
pendent Components in an 
Electrocardiogram 

As our second application we choose a case where 
Shannon theory is the proper setting. We show in 
Fig. 2 an ECG recorded from the abdomen and tho- 
rax of a pregnant woman [De Moor, 1997[ (8 chan- 
nels, sampling rate 500 Hz, 5 s total). It is already 
seen from this graph that there are at least two im- 
portant components in this ECG: the heartbeat of 
the mother, with a frequency of « 3 beat/s, and 
the heartbeat of the fetus with roughly twice this 
frequency. Both are not synchronized. In addition 
there is noise from various sources (muscle activity, 
measurement noise, etc.). While it is easy to detect 
anomalies in the mother's ECG from such a record- 
ing, it would be difhcult to detect them in the fetal 
ECG. 

As a first approximation we can assume that the 
total ECG is a linear superposition of several inde- 
pendent sources (mother, child, noisci, noise2,...). 
A standard method to disentangle such superpo- 
sitions is independent component analysis (ICA) 
[Hyvarinen et al., 200 1[ . In the simplest case one 
has n independent sources Si{t), i — l...n and n 
measured channels Xi{t) obtained by instantaneous 
superpositions with a time independent non-singular 
matrix A, 

n 

X^{t)=Y,A^JS,{t) . (23) 
J = l 

In this case the sources can be reconstructed by ap- 
plying the inverse transformation W = A^^ which is 
obtained by minimizing the (estimated) mutual infor- 
mations between the transformed components yi (t) = 
^ij^jW- If some of the sources are Gaussian, 
this leads to ambiguities [Hyvarinen et al., 2001[ , but 



7 



1 |-^^^|-^j'^A-'l|--^r^|JK^n|~4--|'^^ 



4 y|'-^(f--,-^j-'^^|n'-^p,.^^(^^'^"-*<"|r^^ 




1 2 seconds 3 4 5 

Figure 2: ECG of a pregnant woman. 

it gives a unique solution if the sources have more 
structure. 

In reality things are not so simple. For instance, 
the sources might not be independent, the number 
of sources (including noise sources!) might be dif- 
ferent from the number of channels, and the mixing 
might involve delays. For the present case this implies 
that the heartbeat of the mother is seen in several 
reconstructed components yi, and that the suppos- 
edly "independent" components are not independent 
at all. In particular, all components yi which have 
large contributions from the mother form a cluster 
with large intra-cluster Mis and small inter-cluster 
Mis. The same is true for the fetal ECG, albeit less 
pronounced. It is thus our aim to 

1) optimally decompose the signals into least depen- 
dent components; 

2) cluster these components hierarchically such that 
the most dependent ones are grouped together; 

3) decide on an optimal level of the hierarchy, such 
that the clusters make most sense physiologically; 

4) project onto these clusters and apply the inverse 
transformations to obtain cleaned signals for the 
sources of interest. 

Technically we proceeded as follows 
IStogbauer et al., 2004| : 

Since we expect different delays in the differ- 
ent channels, we first used Takens delay embedding 
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Figure 3: Least dependent components of the ECG 
shown in Fig.|21 after increasing the number of chan- 
nels by delay embedding. 
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[Takens, 19 80' with time delay 0.002 s and embed- 
ding dimension 3, resulting in 24 channels. We 
then formed 24 linear combinations yi{t) and deter- 
mined the de-mixing coefficients Wij by minimizing 
the overall mutual information between them, using 
the MI estimator proposed in [Kraskov et al., 2003| . 
There, two classes of estimators were introduced, one 
with square and the other with rectangular neigh- 
bourhoods. Within each class, one can use the num- 
ber of neighbours, called k in the following, on which 
the estimate is based. Small values of k lead to a 
small bias but to large statistical errors, while the 
opposite is true for large k. But even for very large 
k the bias is zero when the true MI is zero, and it is 
systematically such that absolute values of the MI are 
underestimated. Therefore this bias does not affect 
the determination of the optimal de-mixing matrix. 
But it depends on the dimension of the random vari- 
ables, therefore large values of k are not suitable for 
the clustering. We thus proceeded as follows: We first 
used k = 100 and square neighbourhoods to obtain 
the least dependent components yi{t), and then used 
k = 3 with rectangular neighbourhoods for the clus- 
tering. The resulting least dependent components are 
shown in Fig. |21 They are sorted such that the first 
components (1 - 5) are dominated by the mother's 
ECG, while the next three contain large contribu- 
tions from the fetus. The rest contains mostly noise, 
although some seem to be still mixed. 

These results obtained by visual inspection are 
fully supported by the cluster analysis. The den- 
drogram is shown in Fig. 0] In constructing it we 
used S{X, Y) (Eq.(|22Il) as similarity measure to find 
the correct topology. Again we would have obtained 
much worse results if we had not normalized it by 
dividing MI by mx + my- In plotting the actual 
dendrogram, however, we used the MI of the cluster 
to determine the height at which the two daughters 
join. The MI of the first five channels, e.g., is « 1.43, 
while that of channels 6 to 8 is « 0.34. For any two 
clusters (tuples) X ~ Xi . . . Xn and Y = Yi . . . y,„ 
one has I{X, Y) > I{X) + I{Y). This guarantees, if 
the MI is estimated correctly, that the tree is drawn 
properly. The two slight glitches (when clusters (1- 
14) and (15-18) join, and when (21-22) is joined with 
23) result from small errors in estimating MI. They 
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Figure 4: Dendrogram for least dependent compo- 
nents. The height where the two branches of a cluster 
join corresponds to the MI of the cluster. 

do in no way effect our conclusions. 

In Fig. 0] one can clearly see two big clusters cor- 
responding to the mother and to the child. There 
are also some small clusters which should be con- 
sidered as noise. For reconstructing the mother and 
child contributions to Fig. |21 we have to decide on 
one specific clustering from the entire hierarchy. We 
decided to make the cut such that mother and child 
are separated. The resulting clusters are indicated 
in Fig. 0] and were already anticipated in sorting the 
channels. Reconstructing the original ECG from the 
child components only, we obtain Fig. |5| 



5 Conclusions 

We have shown that MI can not only be used as a 
proximity measure in clustering, but that it also sug- 
gests a conceptually very simple and natural hierar- 
chical clustering algorithm. We do not claim that 
this algorithm, called mutual information clustering 
(MIC), is always superior to other algorithms. In- 
deed, MI is in general not easy to estimate. Obvi- 
ously, when only crude estimates are possible, also 
MIC will not give optimal results. But as MI esti- 
mates are becoming better, also the results of MIC 
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Figure 5: Original EGG where all contributions ex- 
cept those of the child cluster have been removed. 

should improve. The present paper was partly trig- 
gered by the development of a new class of MI 
estimators for continuous random variables which 
have very small bias and also rather small variances 
IKraskov et al., 2003| . 

We have illustrated our method with two appli- 
cations, one from genetics and one from cardiology. 
For neither application MIG might give the very best 
clustering, but it seems promising that one common 
method gives decent results for both, although they 
are very different. 

The results of MIC should improve, if more data 
become available. This is trivial, if we mean by that 
longer time sequences in the application to EGG, and 
longer parts of the genome in the application of Sec. 3. 
It is less trivial that we expect MIC to make fewer 
mistakes in a phylogenetic tree, when more species 
are included. The reason is that close-by species will 
be correctly joined anyhow, and families - which now 
are represented only by single species and thus are 
poorly characterized - will be much better described 
by the concatenated genomes if more species are in- 
cluded. 

There are two versions of information theory, al- 
gorithmic and probabilistic, and therefore there are 
also two variants of MI and of MIG. We discussed 
in detail one application of each, and showed that 
indeed common concepts were involved in both. In 



particular it was crucial to normalize MI properly, 
so that it is essentially the relative Ml which is used 
as proximity measure. For conventional clustering 
algorithms using algorithmic MI as proximity mea- 
sure this had already been stressed in |Li et al., 2001| 
|Li et al., 2002| , but it is even more important in MIC, 
both in the algorithmic and in the probabilistic ver- 
sions. 

In the probabilistic version, one studies the cluster- 
ing of probability distributions. But usually distribu- 
tions are not provided as such, but are given implic- 
itly by finite random samples drawn (more or less) 
independently from them. On the other hand, the 
full power of algorithmic information theory is only 
reached for infinitely long sequences, and in this limit 
any individual sequence defines a sequence of prob- 
ability measures on finite subsequences. Thus the 
strict distinction between the two theories is some- 
what blurred in practice. Nevertheless, one should 
not confuse the similarity between two sequences (two 
English books, say) and that between their subse- 
quence statistics. Two sequences are maximally dif- 
ferent if they are completely random, but their statis- 
tics for short subsequences is then identical (all sub- 
sequences appear in both with equal probabilities). 
Thus one should always be aware of what similarities 
or independencies one is looking for. The fact that 
MI can be used in similar ways for all these problems 
is not trivial. 

We would like to thank Arndt von Haesseler, Wal- 
ter Nadler and Volker Roth for many useful discus- 
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